Fast Text Processing for Information Retrieval
نویسندگان
چکیده
We describe an advanced text processing system for information retrieval from natural language document collections. We use both syntactic processing as well as statistical term clustering to obtain a representation of documents which would be more accurate than those obtained with more traditional key-word methods. A reliable top-down parser has been developed that allows for fast processing of large amounts of text, and for a precise identification of desired types of phrases for statistical analysis. Two statistical measures are computed: the measure of informational contribution of words in phrases, and the similarity measure between words. A P P R O X I M A T E P A R S I N G W I T H T T P T r p (Tagged Text Parser) is a top down English parser specifically designed for fast, reliable processing of large amounts of text. The parser operates on a tagged input, where each word has been marked with a tag indicating a syntactic category: a part of speech with selected morphological features such as number, tense, mode, case end degree) As an example, consider the following sentence from an article appearing in the Communications of the ACM: The binary number system often many advantages over a decimal representation for a high-performance, general-purpose computer. This sentence is tagged as follows (we show the best-tags option only; dt determiner, nn singular noun, nns plural noun, in preposition, jj adjective, vbz verb in present tense third person singular): [[the,dt],[binary,jj],[number,nn],[system,rm],[offers,vbz], [many,jj],[adventages,rms],[over, in], [a,dt],[decimal,jj], [representafion,nn], [for,in] ,[a, dt], [high_per formence,nn], [comS,comS],[general purpose,nn],[computer,nn],[perS,perS]] Tagging of the input text substantially reduces the search space of a top-down parser since it resolves many lexical ambiguities, such as singular verb vs. plural noun, past tense vs. past participle, or preposition vs. wh-determiner. Tagging also helps to reduce the number of parse structures that can be assigned to a sentence, decreases the demand for consulting of the dictionary, end simplifies dealing with unknown words. t At present we use the 35-tag Penn Treebank Tagset created at the University of Pennsylvania. Prior to parsing, the text is tagged automatically using a program supplied by Bolt Beranek and Newman. We wish to thank Ralph Weischedel and Marie Meeter of BBN for providing and assisting in the use of the tagger. T] 'P is based on the Linguistic String Grammar developed by Sager [8] and partially incorporated in the Proteus parser [3]. T I P is written in Quintus Prolog, and currently implements more than 400 grammar productions. The restriction component of the original LSP Grammar as well as the lamlxta-reduction based "semantics" of the Proteus implementation have been redesigned for the unification-hased environment. 2 TI 'P produces a regularized representation of each parsed sentence that reflects the sentence's logical structure. This representation may differ considerably from a standard parse tree, in that the constituents get moved around (e.g., de-passivization, de-dativization), and carrain noun phrases get transformed into equivalent clauses (denon'finalization). The aim is to produce a uniform representation across different paraphrases; for example, the phrase context-free language recognition or parsing is represented as shown below: [[verb, [or,[recognize,parse]]] [subject, anyone] [object, [np,[n,language], [edj,[context See]]]]]. The parser is equipped with a time-out mechanism that allows for fast closing of more difficult sub-constituents after a preset amount of time has elapsed without producing a parse. When the time-out option is turned on (which happens automatically during the parsing), the parser is permitted to skip portions of input to reach a starter terminal for the next constituent to be parsed, and closing the currently open one (or ones) with whatever partial representation has been generated thus far. The result is an approximate partial parse, which shows the overall structure of the sentence, from which some of the constituents may be missing. Since the time-out option can be regulated by setting an appropriate flag before the parsing starts, the parser may be tuned to reach an acceptable compromise between its speed and precision. The time-out mechanism is implemented using a straightforward parameter passing and is at present limited to only a subset of nonterminals used by the grammar. Suppose that X is such a nonterminal, and that it occurs on the right-hand side of a production S > X Y Z. The set of "starters" is computed for Y, which consists of the word tags that can occur as the left-most 2 See [10] for details.
منابع مشابه
ایجاز:یک سامانه عملیاتی برای خلاصهسازی تکسندی متون خبری فارسی
The rapid growth of published documents on the web has created some new requests for processing, classification and information retrieval. So, the use of natural language processing tools has increased around the world. Automatic summarization known as the core of a wide range of text-processing tools such as decision systems, accountability systems, search engines, etc. And always has been inv...
متن کاملارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار
In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus, t generated from a speech recognition system and a query as inputs, to search queries in these documents and find relevant documents. A basic problem in a typical speech recognized text is some error percentage in recognition. This, results erroneously ass...
متن کاملPre-processing text for web information retrieval purposes by splitting compounds into their morphemes
In web information retrieval, the interpretation of text is crucial. In this paper, we describe an approach to ease the interpretation of compound word (i.e. words that consist of other words such as “handshake” or “blackboard”). We argue that in the web information retrieval domain, a fast decomposition of those words is necessary and a way to split as many words as possible, while we believe ...
متن کاملFast Incremental Indexing for Full-Text Information Retrieval
Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collecti...
متن کاملRobust and Fast Lyric Search based on Phonetic Confusion Matrix
This paper proposes a robust and fast lyric search method for music information retrieval. Current lyric search systems by normal text retrieval techniques are severely deteriorated in the case that the queries of lyric phrases contain incorrect parts due to mishearing and misremembering. To solve this problem, the authors apply acoustic distance, which is computed based on a confusion matrix o...
متن کاملKnowledge Representation with Ontology
As a backbone of the Semantic Web, Ontologies provide a shared understanding of a domain of text. Ontologies, with their appearance, usage, and classification address for concrete ontology language which is important for the Semantic Web. They can be used to support a great variety of tasks in different domains such as knowledge representation, natural language processing, information retrieval...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1991